Homogeneity Score (homogeneity_score)#

Homogeneity is an external clustering metric: it scores how pure each predicted cluster is with respect to ground-truth class labels.

Intuition: If I open a cluster, do I mostly see one class?

  • Perfectly pure clusters → score = 1.0

  • Completely mixed clusters (clusters don’t help predict the class) → score ≈ 0.0


Learning goals#

By the end you should be able to:

  • explain homogeneity in terms of entropy

  • compute it from a contingency matrix (class × cluster counts)

  • implement homogeneity_score from scratch in NumPy

  • visualize what increases / decreases the score

  • use it to tune a simple clustering algorithm (with caveats)


Quick import#

from sklearn.metrics import homogeneity_score

Table of contents#

  1. Intuition: purity vs completeness

  2. The math: entropy & conditional entropy

  3. NumPy implementation (from scratch)

  4. Worked toy example + plots

  5. How mixing affects homogeneity

  6. Pitfall: over-segmentation

  7. Using homogeneity to tune k-means (grid search)

  8. Pros/cons + when to use

import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import (
    completeness_score as sk_completeness_score,
    homogeneity_score as sk_homogeneity_score,
    v_measure_score as sk_v_measure_score,
)

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)

1) Intuition: purity vs completeness#

Homogeneity cares about purity inside each predicted cluster.

  • If a cluster contains multiple ground-truth classes, it’s impure → homogeneity goes down.

  • If a ground-truth class gets split across many clusters, homogeneity does not complain.

That second point is why homogeneity is often paired with completeness:

  • Homogeneity: each cluster contains only members of a single class.

  • Completeness: all members of a given class are assigned to the same cluster.

Both together are summarized by the V-measure (harmonic mean).

A key property: homogeneity is label-permutation invariant. If you relabel clusters (e.g., swap cluster 0 and 1), the score doesn’t change.

2) The math: entropy & conditional entropy#

We have:

  • ground-truth class labels: \(c \in \{1,\dots,C\}\) (random variable \(C\))

  • predicted cluster labels: \(k \in \{1,\dots,K\}\) (random variable \(K\))

2.1 Contingency matrix#

Let the contingency matrix \(N \in \mathbb{N}^{C\times K}\) count co-occurrences:

\[ N_{c,k} = \#\{i: y_i = c, \; \hat y_i = k\}. \]

Define totals:

  • \(n = \sum_{c,k} N_{c,k}\)

  • class counts: \(n_c = \sum_k N_{c,k}\)

  • cluster counts: \(n_k = \sum_c N_{c,k}\)

2.2 Entropy#

The entropy of the class variable is

\[ H(C) = -\sum_{c=1}^C p(c)\,\log p(c), \qquad p(c)=\frac{n_c}{n}. \]

2.3 Conditional entropy#

The conditional entropy of classes given clusters is

\[ H(C\mid K) = \sum_{k=1}^K p(k)\,H(C\mid K=k) = -\sum_{k=1}^K\sum_{c=1}^C p(c,k)\,\log p(c\mid k), \]

where

\[ p(k)=\frac{n_k}{n},\quad p(c,k)=\frac{N_{c,k}}{n},\quad p(c\mid k)=\frac{N_{c,k}}{n_k}. \]

2.4 Homogeneity score#

Homogeneity is defined as

\[ h = 1 - \frac{H(C\mid K)}{H(C)}. \]

Edge case: if \(H(C)=0\) (all points belong to one class), homogeneity is defined as 1.0.

Interpretation:

  • \(H(C\mid K)=0\) ⇒ each cluster determines the class perfectly ⇒ \(h=1\)

  • \(H(C\mid K)=H(C)\) ⇒ clusters tell you nothing about the class ⇒ \(h=0\)

Note: the log base cancels in the ratio, so you can use natural log.

A nice identity (using mutual information \(I(C;K)\)):

\[ h = \frac{I(C;K)}{H(C)}. \]

So homogeneity is the fraction of class entropy explained by the clustering.

3) NumPy implementation (from scratch)#

We’ll implement:

  • a contingency matrix builder (any label types)

  • entropy + conditional entropy from counts

  • homogeneity_score using the definition above

def encode_labels(y):
    '''Map arbitrary labels to integer ids 0..(m-1).'''
    y = np.asarray(y)
    classes, y_idx = np.unique(y, return_inverse=True)
    return classes, y_idx


def contingency_matrix_np(y_true, y_pred):
    '''Contingency matrix N with N[c,k] = count(true=c, pred=k).'''
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError("y_true and y_pred must have the same shape")

    true_labels, true_idx = encode_labels(y_true)
    pred_labels, pred_idx = encode_labels(y_pred)

    n_classes = true_labels.size
    n_clusters = pred_labels.size

    N = np.zeros((n_classes, n_clusters), dtype=int)
    np.add.at(N, (true_idx, pred_idx), 1)

    return N, true_labels, pred_labels


def entropy_from_counts(counts: np.ndarray) -> float:
    '''Shannon entropy of a discrete distribution given counts.'''
    counts = np.asarray(counts, dtype=float)
    total = counts.sum()
    if total <= 0:
        return 0.0

    p = counts[counts > 0] / total
    return float(-(p * np.log(p)).sum())


def conditional_entropy_C_given_K_from_contingency(N: np.ndarray) -> float:
    '''Compute H(C|K) from contingency matrix N (classes x clusters).'''
    N = np.asarray(N, dtype=float)
    n = N.sum()
    if n <= 0:
        return 0.0

    n_k = N.sum(axis=0, keepdims=True)  # (1, K)

    # H(C|K) = - sum_{c,k} p(c,k) log p(c|k)
    with np.errstate(divide="ignore", invalid="ignore"):
        p_ck = N / n
        p_c_given_k = np.divide(N, n_k, where=n_k > 0)
        terms = np.where(N > 0, p_ck * np.log(p_c_given_k), 0.0)

    return float(-terms.sum())


def homogeneity_score_np(y_true, y_pred) -> float:
    '''Homogeneity score in [0,1]. Matches sklearn's definition.'''
    N, _, _ = contingency_matrix_np(y_true, y_pred)

    H_C = entropy_from_counts(N.sum(axis=1))
    if H_C == 0.0:
        return 1.0

    H_C_given_K = conditional_entropy_C_given_K_from_contingency(N)
    h = 1.0 - H_C_given_K / H_C

    # Numerical safety
    return float(np.clip(h, 0.0, 1.0))
# Quick sanity check vs scikit-learn

y_true = rng.integers(0, 4, size=500)
y_pred = rng.integers(0, 7, size=500)

h_np = homogeneity_score_np(y_true, y_pred)
h_sk = sk_homogeneity_score(y_true, y_pred)

print("homogeneity (numpy): ", h_np)
print("homogeneity (sklearn):", h_sk)
print("abs diff:", abs(h_np - h_sk))

# Edge case: one true class -> defined as 1.0
print("one-class edge case:", homogeneity_score_np(np.zeros(20), rng.integers(0, 3, size=20)))
homogeneity (numpy):  0.007663540181941153
homogeneity (sklearn): 0.007663540181940872
abs diff: 2.8102520310824275e-16
one-class edge case: 1.0

4) Worked toy example + plots#

Let’s build a small example and look at:

  • the contingency matrix

  • per-cluster class proportions

  • per-cluster class entropy (how “mixed” each cluster is)

y_true_toy = np.array([
    "A", "A", "A", "A", "A",
    "B", "B", "B", "B",
    "C", "C", "C", "C",
])

# Clusters are somewhat mixed:
# - cluster 0: mostly A
# - cluster 1: mix of A and B
# - cluster 2: pure C
# - cluster 3: mix of B and C
y_pred_toy = np.array([
    0, 0, 0, 1, 1,
    1, 1, 3, 3,
    2, 2, 2, 3,
])

N_toy, classes_toy, clusters_toy = contingency_matrix_np(y_true_toy, y_pred_toy)

h_toy = homogeneity_score_np(y_true_toy, y_pred_toy)

print("classes:", classes_toy)
print("clusters:", clusters_toy)
print("contingency N (rows=class, cols=cluster):")
print(N_toy)
print("homogeneity:", h_toy)

fig = px.imshow(
    N_toy,
    x=[f"cluster {k}" for k in clusters_toy],
    y=[f"class {c}" for c in classes_toy],
    text_auto=True,
    color_continuous_scale="Blues",
    title=f"Toy contingency matrix (homogeneity={h_toy:.3f})",
    labels={"x": "predicted cluster", "y": "true class", "color": "count"},
)
fig.update_layout(coloraxis_showscale=False)
fig.show()
classes: ['A' 'B' 'C']
clusters: [0 1 2 3]
contingency N (rows=class, cols=cluster):
[[3 2 0 0]
 [0 2 0 2]
 [0 0 3 1]]
homogeneity: 0.6704302058675669
# Per-cluster class proportions and per-cluster entropy

cluster_sizes = N_toy.sum(axis=0)
proportions = np.divide(N_toy, cluster_sizes, where=cluster_sizes > 0)

cluster_entropies = np.array([entropy_from_counts(N_toy[:, k]) for k in range(N_toy.shape[1])])

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Class proportions within each cluster", "Entropy within each cluster"),
)

# stacked bars (proportions)
for i, c in enumerate(classes_toy):
    fig.add_trace(
        go.Bar(
            x=[f"cluster {k}" for k in clusters_toy],
            y=proportions[i],
            name=f"class {c}",
        ),
        row=1,
        col=1,
    )

fig.update_yaxes(title_text="proportion", range=[0, 1], row=1, col=1)
fig.update_xaxes(title_text="cluster", row=1, col=1)

# entropies
fig.add_trace(
    go.Bar(
        x=[f"cluster {k}" for k in clusters_toy],
        y=cluster_entropies,
        name="entropy",
        marker_color="gray",
    ),
    row=1,
    col=2,
)

fig.update_yaxes(title_text="H(C | K=k)", row=1, col=2)
fig.update_xaxes(title_text="cluster", row=1, col=2)

fig.update_layout(barmode="stack", title_text="What makes homogeneity go up/down")
fig.show()

5) How mixing affects homogeneity#

Consider a binary problem with two equally common classes.

We’ll create cluster labels by copying the true labels and then flipping a fraction \(\varepsilon\) of them.

  • \(\varepsilon = 0\) ⇒ perfectly pure clusters ⇒ homogeneity = 1

  • larger \(\varepsilon\) ⇒ more mixing inside clusters ⇒ homogeneity drops

def flip_fraction(y, eps: float, rng: np.random.Generator) -> np.ndarray:
    y = np.asarray(y, dtype=int)
    if not (0.0 <= eps <= 1.0):
        raise ValueError("eps must be in [0,1]")

    y_pred = y.copy()
    flip = rng.random(size=y.size) < eps
    y_pred[flip] = 1 - y_pred[flip]
    return y_pred


n = 2000
# perfectly balanced classes
true_bin = np.r_[np.zeros(n // 2, dtype=int), np.ones(n // 2, dtype=int)]
rng.shuffle(true_bin)

eps_grid = np.linspace(0.0, 0.5, 51)
h_values = []

for eps in eps_grid:
    pred_bin = flip_fraction(true_bin, eps=float(eps), rng=rng)
    h_values.append(homogeneity_score_np(true_bin, pred_bin))

fig = go.Figure()
fig.add_trace(go.Scatter(x=eps_grid, y=h_values, mode="lines+markers", name="homogeneity"))
fig.update_layout(
    title="Homogeneity vs label mixing (binary flip noise)",
    xaxis_title="flip fraction ε",
    yaxis_title="homogeneity",
    yaxis_range=[0, 1.02],
)
fig.show()

6) Pitfall: over-segmentation can reach 1.0#

Homogeneity ignores whether a class is split across many clusters.

If each class is divided into multiple sub-clusters (all pure), homogeneity stays 1.0, even though the clustering is often less useful.

We’ll demonstrate this by taking \(C=3\) classes and splitting each class into \(m\) pure clusters.

We’ll also show completeness and V-measure for contrast.

C = 3
n_per_class = 400

y_true = np.repeat(np.arange(C), n_per_class)
rng.shuffle(y_true)


def split_each_class_into_m_clusters(y_true, m: int, rng: np.random.Generator) -> np.ndarray:
    y_true = np.asarray(y_true, dtype=int)
    y_pred = np.empty_like(y_true)

    for c in range(np.max(y_true) + 1):
        idx = np.where(y_true == c)[0]
        sub = rng.integers(0, m, size=idx.size)
        y_pred[idx] = c * m + sub

    return y_pred


m_grid = np.arange(1, 21)

h_list = []
comp_list = []
v_list = []

for m in m_grid:
    y_pred = split_each_class_into_m_clusters(y_true, m=int(m), rng=rng)
    h_list.append(homogeneity_score_np(y_true, y_pred))
    comp_list.append(sk_completeness_score(y_true, y_pred))
    v_list.append(sk_v_measure_score(y_true, y_pred))

fig = go.Figure()
fig.add_trace(go.Scatter(x=m_grid, y=h_list, mode="lines+markers", name="homogeneity"))
fig.add_trace(go.Scatter(x=m_grid, y=comp_list, mode="lines+markers", name="completeness"))
fig.add_trace(go.Scatter(x=m_grid, y=v_list, mode="lines+markers", name="v-measure"))

fig.update_layout(
    title="Over-segmentation: splitting each class into m pure clusters",
    xaxis_title="m (clusters per true class)",
    yaxis_title="score",
    yaxis_range=[0, 1.02],
)
fig.show()

8) Pros/cons + when to use#

Pros#

  • Interpretable: “cluster purity” aligned with many real use cases

  • Scale [0, 1] and label-permutation invariant

  • Works for multiclass and imbalanced class distributions

  • Information-theoretic: connects to entropy and mutual information

Cons / pitfalls#

  • Requires ground-truth labels (so it’s not usable for truly unsupervised evaluation)

  • Ignores completeness → can be artificially high with many clusters (over-segmentation)

  • Not a smooth/differentiable objective (used for evaluation / selection, not gradient training)

  • Can hide issues if small impure clusters exist but are tiny (weighted by cluster size)

Good use cases#

  • Benchmarking clustering when you have a gold standard (topics, categories, known segments)

  • Situations where mixing classes inside a cluster is especially harmful (you need “clean buckets”)

  • As part of V-measure (homogeneity + completeness) or alongside other external metrics (ARI, AMI)

References#

  • Rosenberg, A., & Hirschberg, J. (2007). V-measure: A conditional entropy-based external cluster evaluation measure.

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html

  • Related metrics: completeness_score, v_measure_score, adjusted_rand_score, adjusted_mutual_info_score

Exercises#

  1. Create a clustering with high homogeneity but low completeness. Verify with plots.

  2. Modify the toy example so one small cluster is very impure. How much does homogeneity change?

  3. Implement completeness_score from scratch and reproduce V-measure.